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(54) METHOD AND DEVICE FOR VIDEO CLASSIFICATION 

(57)Abstract: 

PROBLEM TO BE SOLVED: To provide the method and the device in which an 
analysis is conducted for the sound information included in video information and 
the video is classified into categories without being influenced by conventional 
categories. 

SOLUTION: A music detecting section 103 conducts a frequency analysis of the 
sound information of the inputted video informationdetects the stability of the 
spectra and detects a music. A voice detecting section 104 detects the harmonic 
structure of the spectrum and detects voices. On the other handa code table 
generating section 105 generates a code table. Thenan acoustic detecting section 
1 06 compares the sound information of the inputted video information with the 
feature vectors of the code table and the kinds of acoustics are detected by the 
closeness of the distance between the sound information and the feature vectors. 
An attribute information accumulating section 1 07 records the position of the 
segment of the detected sound information and extracts the kinds of the detected 
sound informationthe length of each segmentthe total length of every kind and the 
pattern of the position of each segment and a video discriminating section 1 08 
discriminates the kind of the video information. 



CLAIMS 



[Claim(s)] 

[Claim 1]An image classifying method which detects musica soundand the 



when at least one of sound exists from sound information characterized by 
comprising the following which inputs video information and is included in this 
inputted video informationand distinguishes a kind of image with occurrence 
patterns of the this detected section. 

A video input stage of carrying out an A/D conversion and inputting digital video 
information when video information is an analog. 

An edge detection stage of conducting frequency analysis of the sound 
information included in this video informationand detecting the stability of a 
spectrum. 

A music detection stage which detects music from the stability of this spectrum. 
A voice detection stage of detecting harmonic structure of this spectrum and 
detecting a soundA code book generation phase which vector-quantizes an 
acoustic feature vector as learned dataand generates a code bookA sound 
detection stage which compares a generated this code book with a feature vector 
of sound information included in this video informationand detects sound with a 
near distanceA stage according to image format which extracts one or more [ of a 
pattern of an attribution information accumulation stage which records a position 
of the section according to kind of detected this sound informationand a kind of 
this detected sound informationthe length of each sectionlength for every whole 
kind and a position of each section ]and distinguishes a kind of this video 
information. 

[Claim 2]The image classifying method according to claim 1 characterized by what 
a differentiation operator of a frequency direction detects edge for in said edge 
detection stage from a spectrogram which arranged said spectrum in a time 
direction. 

[Claim 3]The image classifying method according to claim 2 characterized by what 
music is detected for from strength of edge of a time direction in constant 
frequency of said spectrogram in said music detection stage. 
[Claim 4]The image classifying method according to claim 2 or 3 characterized by 
what a radial fin type filter is usedharmonic structure is detectedand a sound is 
detected for after removing a portion with strong edge of said spectrogram in said 
voice detection stage. 

[Claim 5]A feature vector of sound information which contains only one kind of 
sound as a reference sound in said sound detection stageAn image classifying 
method given in either of claims 1 23and 4 characterized by what distance with the 
center of gravity of said code book is computedand is used as a judging standard 
of detection of distance of the center of gravity of this code book with high 
frequency where distance becomes the nearestand a feature vector of sound 
information included in said video information. 

[Claim 6]A stage according to said image format creates a code book by making a 
kind of detected sound informationthe length of each sectionlength for every 
whole kindand a position of each section into a classification vectorand The center 
of gravity of this code bookAn image classifying method given in either of claims 



1 234and 5 characterized by what distance with a classification vector of sound 
information included in said video information is used for a distinction standard for. 
[Claim 7]An image classifying apparatus which detects musica soundand the 
section when at least one of sound exists from sound information characterized by 
comprising the following which inputs video information and is included in this 
inputted video informationand distinguishes a kind of image with occurrence 
patterns of the this detected section. 

A video input section which carries out an A/D conversion and inputs digital video 
information when video information is an analog. 

An edge detection section which conducts frequency analysis of the sound 
information included in this video informationand detects the stability of a 
spectrum. 

A music primary detecting element which detects music from the stability of this 
spectrum. 

A voice detection part which detects harmonic structure of this spectrum and 
detects a soundA code book generation part which vector-quantizes an acoustic 
feature vector as learned dataand generates a code bookA sound primary 
detecting element which compares a generated this code book with a feature 
vector of sound information included in this video informationand detects sound 
with a near distanceA part according to image format which extracts an attribution 
information accumulating part which records a position of the section according to 
detected this sound informationa kind of this detected sound informationthe length 
of each sectionlength for every whole kindand a pattern of a position of each 
section one or moreand distinguishes a kind of this video information. 

[Claim 8]The image classifying apparatus according to claim 7 characterized by 
what said edge detection section is what detects edge from a spectrogram which 
arranged said spectrum in a time direction with a differentiation operator of a 
frequency direction. 

[Claim 9]The image classifying apparatus according to claim 8 characterized by 
what said music primary detecting element is what detects music from strength of 
edge of a time direction in constant frequency of said spectrogram. 
[Claim 10]The image classifying apparatus according to claim 8 or 9 characterized 
by what is been what uses a radial fin type filterdetects harmonic structureand 
detects a sound after said voice detection part removes a portion with strong 
edge of said spectrogram. 

[Claim 1 1]A feature vector of sound information in which said sound primary 
detecting element contains only one kind of sound as a reference soundAn image 
classifying apparatus given in either of claims 789and 1 0 characterized by what is 
been what is used as a judging standard of detection of distance of the center of 
gravity of this code book with high frequency where distance with the center of 
gravity of said code book is computedand distance becomes the nearestand a 
feature vector of sound information included in said video information. 
[Claim 12]A part according to said image format makes a classification vector a 



kind of detected sound informationthe length of each sectionlength for every 
whole kindand a position of each sectioncreate a code bookand The center of 
gravity of this code bookAn image classifying apparatus given in either of claims 
7891 Oand 1 1 characterized by what is been what uses for a distinction standard 
distance with a classification vector of sound information included in said video 
information. 



DETAILED DESCRIPTION 



[Detailed Description of the Invention] 
[0001] 

[Field of the Invention]In order to treat an image efficientlythe art which gives the 
attribution information of an image automatically is required. Attribution 
information is used for edit of an imageprocessinga classificationetc. in the related 
field of image work. This invention extracts the characteristic quantity contained in 
an imageand relates to the art of classifying an image according to characteristic 
quantity. 
[0002] 

[Description of the Prior Art]It is indispensable to divide roughly what kind of 
things the contents of the image arewhen treating efficiently a lot of images used 
by a system like video on demand. Although the image is mainly classified into 
newsa sporta dramaa moviemusicdocumentaryeducationvarietyanimeetc. nowthe 
method of identifying some automatically among these is proposed. In "S. Fischer 
et.al:Automatic Recognition of Film GenresACM Multimedia'95and pp.295-301." 
The change of a scene and a motion of a camera are detected from the sexual 
desire news of a pictureit combines with change of the amplitude of sound 
informationand the classification of newsa sport (tennis and car race)animeand 
commercials is performed. If there are few motions of a camera and there is a 
repetition (sound which hits the ball of tennis) of news and a periodic sounda 
sportlf the place where language broke off has few noises and the whole will 
become black at the change of anime (there are few background sounds because 
of postrecording)and a scenethe typical feature regarded as having called it 
commercials for every genre will be used. 

[0003] . 
[Problem(s) to be Solved by the Invention]According to the above-mentioned 
conventional artthe image is mainly classified based on picture informationand 
detailed analysis about sound information is not conducted with it. Since the 
peculiar feature is restricted for every genre detectable from picture 
informationthe range which can be classified is narrow. The genre which cannot be 
classified exists in the top-down method that the feature for every genre defined 
from the former as mentioned above is found out. 

[0004]On the other handthe sound information included in an image is reflecting 
the contents of the image welland tends to detect the feature peculiar to the kind 



of contents. It is possible to realize the classifying method which took in the 
bottom-up element by detecting the characteristic sound which analyzes sound 
information and is looked at by the general imageand classifying an image from the 
occurrence patterns. 

[0005]The purpose of this invention analyzes the sound information included in 
video informationand there is in providing the image classifying method and device 
which classify an image into the category which is not caught by the existing genre. 
[0006] 

[Means for Solving the Problem]In order to attain the above-mentioned purposean 
image classifying method of this inventionA video input stage of carrying out an 
A/D conversion and inputting digital video information when video information is an 
analogA music detection stage which conducts frequency analysis of the sound 
information included in this video informationdetects the stability of a spectrumand 
detects musicA voice detection stage of detecting harmonic structure of this 
spectrum and detecting a soundA code book generation phase which vector- 
quantizes an acoustic feature vector as learned dataand generates a code bookA 
sound detection stage which compares a feature vector of a generated code book 
and sound information included in this video informationand detects sound with a 
near distanceAn attribution information accumulation stage which records a 
position of the section according to kind of detected this sound informationBy 
having a stage according to image format which extracts a kind of detected this 
sound informationthe length of each sectionlength for every whole kindand a 
pattern of a position of each section one or moreand distinguishes a kind of this 
video information. It becomes possible to detect musica soundand the section 
when at least one of sound exists from sound information included in inputted 
video informationto distinguish a kind of image and to classify into a wide range 
category according to occurrence patterns of the this detected section. 
[0007]A video input section which carries out the A/D conversion of the image 
classifying apparatus of this invention when video information is an analogand 
inputs digital video informationA music primary detecting element which conducts 
frequency analysis of the sound information included in this video 
informationdetects the stability of a spectrumand detects musicA voice detection 
part which detects harmonic structure of this spectrum and detects a soundA 
code book generation part which vector-quantizes an acoustic feature vector as 
learned dataand generates a code bookA sound primary detecting element which 
compares a feature vector of a generated code book and sound information 
included in this video informationand detects sound with a near distanceAn 
attribution information accumulating part which records a position of the section 
according to kind of detected this sound informationBy providing a part according 
to image format which extracts a kind of detected this sound informationthe length 
of each sectionlength for every whole kindand a pattern of a position of each 
section one or moreand distinguishes a kind of this video information. It becomes 
possible to detect musica soundand the section when at least one of sound exists 
from sound information included in inputted video informationto distinguish a kind 



of image and to classify into a wide range category according to occurrence 
patterns of the this detected section. 

[0008]In above image classifying method and devicedetecting strength of edge of a 
time direction in constant frequency of a spectrogram enables it to detect music 
easily. 

[0009]After removing a portion with strong edge of this spectrogrameven when 
music has lappedit becomes possible by using a radial fin type filter and detecting 
harmonic structure to detect a sound easily. 

[001 0]A feature vector of sound information which contains only one kind of sound 
as a reference soundDistance with the center of gravity of this code book is 
computedand distance becomes possible [ detecting learned sound easily by using 
as a judging standard of detection of distance of the center of gravity of this code 
book with high frequency which becomes the nearestand a feature vector of sound 
information included in this video information ]. 

[001 1]Create a code book by making a kind of detected sound informationthe 
length of each sectionlength for every whole kindand a position of each section 
into a classification vectorand The center of gravity of this code booklt becomes 
possible to classify an image according to using for a distinction standard distance 
with a classification vector of sound information included in this video information 

easily. 
[0012] 

[Embodiment of the Invention]Nextan embodiment of the invention is described in 
detail with reference to drawings. 

[001 3l Drawing 1 is a block diagram showing the outline composition of the image 
classifying apparatus of the example of 1 embodiment of this invention. 
[0014]The video input section 101 which carries out the A/D conversion of the 
image classifying apparatus of this example of an embodiment when video 
information is an analogand is inputtedThe edge detection section 102 which 
conducts frequency analysis of the sound informationdetects the edge of a sound 
spectrogramand is removed if neededThe music primary detecting element 103 
which detects music from sound informationand the voice detection part 104 
which detects a soundThe code book generation part 1 05 which generates a code 
book from acoustic learned dataand the learned sound and the sound primary 
detecting element 106 which detects the sound of an identical kindThe parts 108 
according to image format which distinguish the kind of video information are 
consisted of by the attribution information accumulating part 107 which records 
the position of the section of the detected soundthe kind of detected sound 
informationthe length of each sectionthe length for every whole kindand the 
position of each section. 

[0015]The sound data of an image inputted from the video input section 101 is 
inputted into the edge detection section 102 by one sideFFT (Fast Fourier 
Transform) processing is carried out by the edge detection section 102and the 
sound spectrogram of the length for about several seconds is generated. Hereit is 
also possible to use LPC (linear predictive coding) instead of FFT. The sound data 



of an image inputted from the video input section 101 is inputted into the sound 
primary detecting element 106 on the other hand. 

[Q016] Drawing 2 is the flow chart which showed processing of the edge detection 
section 102 of the example of 1 embodiment of this inventionthe music primary 
detecting element 103and the voice detection part 104. Hereafterthose examples 
of operation are explained with reference to drawing 1 and drawing 2 . 
[001 7]A spectrogram is generated by the FFT processing 201 of the edge 
detection section 102. The frame length in that case is tens-100 millisecondsand a 
detection interval is several seconds. 

[0018]The situation of the generated spectrogram is simplified and shown in 
drawing 3 . A spectrogram is actually obtained as a shade image. 301 is a locus of 
the spectrum of a music ingredient and 302 is a locus of the spectrum of a voice 
component. Since the locus stable in the frequency direction is drawna music 
ingredient is detected using this character. Firstthe edge EDi of the time direction 
in the frequency i is detected using a differentiation operator by the edge 
detection process 202. When the value of the edge EDi is larger than threshold 
TH1 in the value of the obtained edge EDi at the threshold process 203 of edge as 
compared with threshold THUhe spectrum of the frequency i is set to 0 in edge 
elimination and the interpolation processing 204 as pretreatment of voice 
detectionand edge is eliminated. Linear interpolation of the spectrum eliminated 
using the value of a nearby spectrum is carried out. This processing is repeated 
about all the zones. In the repetition decision processing 205if i becomes equal to 
n-1a repetition will be finished, n is a point size of the frame length of FFT here. 
[0019]Nexttotal of the strength of edge is computed by the edge intensity 
calculation processing 206and in the threshold process 207 of edge intensitywhen 
the strength of the computed edge is larger than threshold TH2it is judged that 
music exists. 

[0020]Since a voice component appears as a striped pattern at equal intervals 
changed in time as shown in 302 of drawing 3 In parallel with the edge intensity 
calculation processing 206rad»al-fin-type-filter processing 208 is performed to a 
spectrogramand in the threshold process 209 of a filter outputif the output of 
filtering is larger than threshold TH3it will be judged that a sound exists. 
r002n Drawing 4 is the flow chart which showed processing of the sound primary 
detecting element 106 of drawing 1 of the example of 1 embodiment of this 
invention. As an example of an acoustic kindthe sound of ****a 
cheerapplausebustleand machineryetc. can be considered. Hereit explains taking 
the case of ****a cheerand applause. 

[0022]Since a clear structure does not appear in a spectrum****a cheerand 
sound like applause are detected using vector quantization. Firstthe sample of 
each sound data is prepared and a code book is created by the code book 
generation part 105. As characteristic quantity of the vector to be usedit is the 
frame length of tens-100 millisecondsand the linear predictor coefficients of about 
16-dimensional one are used. It is also possible to use LPC cepstrumFFT 
cepstruma filter bank outputetc. Sample data can obtain such a good result that it 



is large. In order to classify into ****a cheerand three categories of applausethe 
cluster of or more three ** is generated from the coefficient of each sample data. 
Belowit explains taking the case of the case where the number of clusters is three. 
Firstthe center-of-gravity vector of a cluster is set to C1C2and C3. With ****a 
cheerand which center-of-gravity vector of applause C1C2and C3 deal is that the 
sample data whose category is known investigates the nearest center-of-gravity 
vectorand it is known easily. 

[0023]The linear predictor coefficients of the inputted sound data of an image are 
computed by the linear-predictor-coefficients calculation processing 401 and the 
distance Li with each center-of-gravity vector is computed by the vector distance 
calculation processing 402. Nextin the threshold process 403 of a shortest 
distance vectorthe size of the distance Li with a center-of-gravity vector is 
investigatedand when larger than threshold TH4it judges that it does not belong to 
three categoriesand is judged as non-sound. In being smaller than threshold TH4it 
judges that what has the shortest distance is chosen in the distance Li with a 
center-of-gravity vector by the shortest distance vector discrimination processing 
404 and the shortest distance vector discrimination processing 405and it belongs 
to a corresponding category. Drawing 4 shows the case where C1C2and C3 
support ****a cheerand applause respectively. 

[0024]The position of the starting point of a sound and a terminal point detected 
in the feature sound primary detecting element 1 02 is recorded on the attribution 
information accumulating part 107 in the format of a time codethe number of bytes 
from a headetc. as a part of attribution information. 

[0025]In the part 106 according to image formatinformation is read from the 
attribution information accumulating part 1 07the content of each sound in the 
whole picture sequence is computedand the classification vector V 
( V 1v2v3v4v5v6) is searched for. Herev1v2v3v4v5and v6 are the content of section 
** to which the sound has lapped with musica sound****a cheerapplauseand 
music respectively. 

[0026]When classifying an image using a classification vectorvector quantization 
as well as sound detection is used. A classification vector is searched for using^ 
various image samplesonly the number of required genres clusters and a center- 
of-gravity vector is searched for. The inputted distance of the classification 
vector of an image and a center-of-gravity vector is computedand it assigns the 
nearest cluster. Although the cluster formed is not necessarily in agreement with 
the genre generally usedif there is little musicwhen reverse and there are 
[ newseducationand ] much music and ****the classification of a comedy etc. is 
possible [ there are many soundsand ]. 

r0027l Drawing 5 is a flow chart which shows processing when software realizes 
the image classifying apparatus of this example of an embodiment. An image is the 
code book generation phase 500 firsta code book is generated from acoustic 
learned datait is inputted from the video input stage 501 and frequency analysis 
and edge detection are performed in the edge detection stage 502. Deletion of 
edge and interpolation are performed if needed. In the music detection stage 503 



and the voice detection stage 504music and a sound are respectively detected 
using the strength of edgeand a radial fin type filter. In the sound detection stage 
505****a cheerand applause are detected using vector quantization. The detected 
information on the starting point of a sound and a terminal point is accumulated by 
the attribution information detection stage 506and when the last of a picture 
sequence is reachedan image is classified in the stage 507 according to image 
format. 
[0028] 

[Effect of the Invention]As explained abovethis invention does the following 
effects so. 

[0029](D Since musica sound****a cheerand applause are detected from the 
sound information included in video information and it was made to compare the 
kind of detected sound informationthe length of each sectionthe length for every 
whole kindand the pattern of the position of each sectionan image can be 
classified into a wide range category. 

[0030](2) Especially when the strength of the edge of the time direction in the 
constant frequency of a spectrogram is detectedmusic can be detected easily. 
[0031 ](3) Especially when a radial fin type filter is used and harmonic structure is 
detected after removing a portion with strong edge of a spectrogramsoundssuch 
as languagecan be detected easily. 

[0032](4) The feature vector of the sound information which contains only one 
kind of sound as a reference soundDistance with the center of gravity of this code 
book is computedand especially when it is made to use as a judging standard of 
detection of the distance of the center of gravity of this code book with high 
frequency where distance becomes the nearestand the feature vector of the 
sound information included in this video informationthe learned sound can be 
detected easily. 

[0033]Create a code book by making the kind of detected sound informationthe 
length of each sectionthe length for every whole kindand the position of each 
section into a classification vectorand to the judging standard of distinction (5) 
The center of gravity of this code bookEspecially when distance with the 
classification vector of the sound information included in this video information is 
usedan image can be easily classified into a wide range category. 



DESCRIPTION OF DRAWINGS 



[Brief Description of the Drawings] 

rDrawing 1]I t is a block diagram showing the outline composition of the image 
classifying apparatus of the example of 1 embodiment of this invention. 
[Drawing 2] It is a flow chart which shows the detection processing of the music in 
the feature sound detection section of the above-mentioned example of an 
embodimentand a sound. 

drawing 3] It is a key map showing the situation of the sound spectrogram 



obtained in the edge detection section of the above-mentioned example of an 
embodiment. 

[Drawing 4] It is a flow chart which shows **** in the feature sound detection 
section of the above-mentioned example of an embodimenta cheerand the 
detection processing of applause. 

[Drawing 5] It is a flow chart which shows the flow of processing at the time of 
realizing the image classifying apparatus of the above-mentioned example of an 
embodiment by software using a computer. 
[Description of Notations] 

101 — Video input section 

102 — Edge detection section 

103 — Music primary detecting element 

104 — Voice detection part 

1 05 — Code book generation part 

106 — Sound primary detecting element 

1 07 — Part according to image format 

108 — Attribution information accumulating part 

201 — FFT (Fast Fourier Transform) processing 

202 — Edge detection process 

203 — Threshold process of edge 

204 — Edge eliminationinterpolation processing 

205 — Repetition decision processing 

206 — Edge intensity calculation processing 

207 — Threshold process of edge intensity 

208 -- Radial-fin-type-filter processing 

209 — Threshold process of a filter output 

301 — Music spectral peak 

302 — Voice spectral peak 

401 — Linear-predictor-coefficients calculation processing 

402 — Vector distance calculation processing 

403 — Threshold process of a shortest distance vector 

404 — Shortest distance vector discrimination processing 

405 — Shortest distance vector discrimination processing 

500 — Code book generation phase 

501 — Video input stage 

502 — Edge detection stage 

503 — Music detection stage 

504 — Voice detection stage 

505 — Sound detection stage 

506 — Attribution information accumulation stage 

507 — Stage according to image format 



